Running R in the Cloud (Part 1)

R excels at a great number of analytical tasks. But the high level functions, rich graphics, and other qualities that make R a likely choice come at a cost--namely, performance.

The good news for R lovers is that there are easy ways to speed up the operations you're running using Amazon EC2 and other cloud-based computing services.

This is a post about running R and Rstudio Server on EC2.

The limitations of your laptop

R is a powerful, though admittedly esoteric, programming language. It's one of my favorites.

"R is a shockingly dreadful language for an exceptionally useful data analysis environment."
-Tim Smith, aRrgh: a newcomer's (angry) guide to data types in R

The feature-rich Base R Package coupled with an interactive environment well suited for data exploration have made R an extremely attractive framework for analysts, scientists, and researchers.

However, R has well-known memory limitations as a result of its holding all objects and variables in memory. This makes certain computationally intensive operations difficult or impossible. There are several interesting open source projects which address these and other limitations (Julia and pandas to name two).

Below, we'll explore how EC2 can be used for analysts using R.

Overview

We'll start by going over EC2 and its configuration using Ubuntu and Rstudio Server. You can watch a video tutorial of all of this as well.

Running Rstudio on EC2 from Greg Lamp on Vimeo.

A little about EC2

Some things are too much for your laptop to handle. You've probably encountered more than one process that causes things to freeze up or the ddply progress bar hang at 2 percent. Knowing how to use EC2 is fantastic for these types of situations.

Before you start...

To follow along, you'll need:

  1. An AWS (Amazon Web Services) account. To get set up, just follow these instructions.
  2. An SSH client. For Mac and Linux users, this means your terminal. For windows users, I recommend Putty or the Chrome extension Secure Shell.
If you've never used terminal before, don't worry. The commands we're going to run are super standard and are (almost) guaranteed to work if you run them in the right order.

Choosing a Flavor

First things first. We need a server. Easy enough. Log into your AWS console and click "EC2". Next click the blue button to "Launch Instance". Then select the "Classic Wizard".

Now we'll select a server. I like working Ubuntu, so let's go w/ "Ubuntu Server 12.04.2 LTS 64 bit". Ubuntu is a pretty standard user-friendly Linux distro. Click continue.

Pick Your Size

Now we need to select the size of our instance. Let's select something fun--say a M3 Double Extra Large (m3.2xlarge). This bad boy has 8 cores and can stuff over 30 GB into memory. Not too shabby.

1, 2, skip a few...

Don't worry about anything on this screen. It's just summarizing the server's configuration details. Go ahead and click Continue.

Again, nothing to see here. Click Continue.

Keep moving.

Create a key pair

OK. On this screen we need to select our key pair. Your key pair is a bit like a username and password (but more secure).

If this is your first time using EC2 then you'll want to create and download a new key pair. Be sure to save this file somewhere safe. If you lose it, you're shit out of luck.

If you're working on a specific project, you might want a specific key pair just for that. Click Continue once you have a key pair saved to your local machine.

Setting Up Security

Your security group defines the ways in which your server can be accessed. Go ahead and create a new group and call it something like "analytics".

For the rules, under "Create a New Rule," select "Custom TCP Rule". Then type "8787" in the PORT field and click "Add Rule". This opens the port where Rstudio runs (8787) and allows you to access Rstudio from the browser.

Next select "SSH" under new rules and click "Add Rule" again. This will let you access your server from terminal.

Click Continue.

3, 2, 1, Launch!

Last page. Make sure everything looks good then hit Launch! While your server is starting up, grab a cup of coffee or otherwise occupy yourself. It typically takes 2-3 minutes to start up.

Using your Server

Head back to the instances page and look for your server. Once the green light is illuminated we're ready to rock and roll.

Click on your server and take a look at the Public DNS. It'll be something like "ec2-23-23-44-106.compute-1.amazonaws.com". Copy that to your clipboard and then open up your SSH client.

Logging In

Navigate to your key pair (.pem) file and type:

ssh -i {yourkey pair}.pem ubuntu@ec2-23-23-44-106.compute-1.amazonaws.com
You may get an error like "WARNING: UNPROTECTED PRIVATE KEY FILE!" (you'll see this if you watch the video). Don't worry, you just need to change the filetype of your .pem. You can do this by executing:
chmod 600 {your key pair}
Check out this question on stackoverflow for more on this.

You'll also get a prompt asking you some more details about your private key. Just type "yes".

Great! You should now be logged into your EC2 server and your terminal should look something like this:

Installing R and Rstudio

Ubuntu makes is really easy to get everything setup. It's just a series of 8 commands you need to execute (in order). You start by adding yourself as a user (or anyone else you want to have access to the machine), then updating your linux box and then installing R. Once you have R working, you can install Rstudio using gdebi, a tool that can install .deb packages.

Connecting to Rstudio

Now the fun part. Open your browser on your local machine and go to this URL: http://ec2-23-23-44-106.compute-1.amazonaws.com:8787/. Assuming everything went according to plan, you should see the login screen. Enter the username/password you created earlier.

Once you've logged in everything should be pretty much business as usual

Getting Data into Rstudio on EC2

This is a common question. Rstudio Server makes it easy; Click on the "Upload" button in the lower left window panel. It'll take care of the rest.

Done and Done.

You're now running your own cloud-based Rstudio. This is a great tool for problems your laptop just can't handle. One thing to remember: shut down your server when you're not using it. Amazon will continue to charge you as long as your server is running. You can aways start it back up (Rstudio will automatically restart).

A follow-up post will go over some cool stuff you can do using this setup like running plyr functions in parallel.